Skip to content

fix: separate byte and character limits in BQ plugin GCS text offload#5565

Draft
caohy1988 wants to merge 3 commits intogoogle:mainfrom
caohy1988:fix/bqaa-offload-unit-mismatch
Draft

fix: separate byte and character limits in BQ plugin GCS text offload#5565
caohy1988 wants to merge 3 commits intogoogle:mainfrom
caohy1988:fix/bqaa-offload-unit-mismatch

Conversation

@caohy1988
Copy link
Copy Markdown

@caohy1988 caohy1988 commented May 1, 2026

Summary

Fixes #5561. Stacked on PR #5528 (fork detection fix).

The GCS text offload decision in HybridContentParser._parse_content_object mixed byte-based and character-based limits in a single min() comparison, producing wrong offload decisions for multi-byte text.

Problem

# Before: mixed-unit comparison
text_len = len(part.text.encode("utf-8"))  # BYTES
offload_threshold = self.inline_text_limit  # 32KB — bytes
if self.max_length != -1 and self.max_length < offload_threshold:
    offload_threshold = self.max_length     # characters!
if self.offloader and text_len > offload_threshold:  # bytes vs ???

When max_content_length < inline_text_limit, the threshold becomes a character count compared against a byte measurement. Example: 3K emoji characters (12K UTF-8 bytes) with max_length=10000 — under both real limits, but the old code computed min(32768, 10000) = 10000 and 12K bytes > 10000 triggered a false offload.

Fix

Evaluate each limit in its own unit — no mixed min():

char_len = len(part.text)
byte_len = len(part.text.encode("utf-8"))

exceeds_inline_byte_limit = byte_len > self.inline_text_limit
exceeds_char_limit = (
    self.max_length != -1 and char_len > self.max_length
)

if self.offloader and (exceeds_inline_byte_limit or exceeds_char_limit):
  • inline_text_limit (32KB): controls inline storage size — bytes
  • max_content_length: controls truncation — characters
  • Text is offloaded if either limit is exceeded

Test plan

  • 220 tests pass (213 existing + 2 fork detection + 5 offload), 0 regressions
  • test_multibyte_text_offloaded_by_byte_limit — 10K emoji (40KB UTF-8) offloaded via byte limit
  • test_ascii_under_both_limits_stays_inline — small ASCII stays inline
  • test_text_exceeding_char_limit_offloaded — ASCII over char limit but under byte limit is offloaded
  • test_no_offloader_falls_back_to_truncate — without offloader, truncates inline
  • test_multibyte_under_char_and_byte_limits_stays_inlineregression test: 3K emoji (12K bytes) with max_length=10000 stays inline (old code falsely offloaded)

🤖 Generated with Claude Code

caohy1988 and others added 2 commits April 28, 2026 11:23
…plugin

When the plugin is deployed via Vertex AI Agent Engine, it is pickled
for transport and unpickled on the server.  __getstate__ sets
_init_pid = 0 as a pickle sentinel.  On the server, _ensure_started()
checks os.getpid() != self._init_pid, which always evaluates to True
since os.getpid() is never 0.  This triggers _reset_runtime_state()
on every cold start even though no fork happened, producing a
misleading "Fork detected (parent PID 0, child PID xx)" warning and
adding unnecessary startup latency from tearing down and re-creating
gRPC state that was already clear.

The fix distinguishes "unpickled, never initialized" (_init_pid == 0)
from "forked from a different process" (_init_pid != 0 and
_init_pid != os.getpid()).  Real forks are still detected by both
os.register_at_fork (line 108) and this PID check.

Related: haiyuan-eng-google/BigQuery-Agent-Analytics-SDK#86

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After _lazy_setup succeeds, set _init_pid = os.getpid() when it was
the pickle sentinel (0).  Without this, an unpickled plugin keeps
_init_pid == 0 forever, disabling the PID-based fork check for the
rest of the instance's lifetime.

Also fix test_reset_on_real_fork to use max(os.getpid() - 1, 1)
instead of hardcoded 99999.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 1, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@adk-bot adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label May 1, 2026
@adk-bot
Copy link
Copy Markdown
Collaborator

adk-bot commented May 1, 2026

Response from ADK Triaging Agent

Hello @caohy1988, thank you for creating this PR!

Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). You can find more information at https://cla.developers.google.com/.

Thanks!

The GCS text offload decision mixed byte-based and character-based
limits in a single min() comparison.  inline_text_limit (32KB) is a
byte-based storage guard, while max_content_length is a character-
based truncation limit.  Computing min(bytes, chars) produced wrong
offload decisions for multi-byte text (CJK, emoji).

The fix evaluates each limit in its own unit:
- inline_text_limit: compared against UTF-8 byte length
- max_content_length: compared against character count
Text is offloaded if either limit is exceeded.

Includes regression test for the specific google#5561 case: 3K emoji chars
(12K bytes) with max_length=10000 — under both real limits but
falsely offloaded by the old mixed-unit min().

Fixes google#5561

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@caohy1988 caohy1988 force-pushed the fix/bqaa-offload-unit-mismatch branch from 7b6e5ef to 040b479 Compare May 1, 2026 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: byte/character unit mismatch in BigQuery analytics plugin GCS text offload

2 participants